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During  the  last  four  years,  through  AFOSR  support,  we  have  been  privileged  to  have  the 


opportunity  to  work  on  the  Natural  Language  Driven  Scene  Analyzer  known  by  the  acronym 
Landscan. 


LandScan  was  intended  to  be  an  interface  not  to  a  database  but  to  an  active  (and  interactive) 
visual  recognition  system.  That  is,  rather  than  searching  a  body  of  existing  facts  about  the 
domain,  the  system  drives  a  vision  component  that  will  process  data  supplied  to  it  by  two 
cameras  and  respond  with  identification  and  analysis  of  objects  found  in  the  scene.  We  currently 
are  looking  at  a  scale  model  of  a  city  block  that  is  part  of  the  University  of  Pennsylvania  campus. 
Obviously,  knowledge  about  language,  the  world,  and  visual  properties  of  objects  is  needed  for 
this,  and  will  have  to  reside  in  the  various  components  of  the  system,  but  the  data  being  returned 
by  the  system  will  be  gathered  in  response  to  the  user’s  requests. 

One  of  the  primary  motivations  of  this  research  was  to  examine  how  the  questions  posed  by 
the  user  and  transmitted  to  the  vision  system  by  the  interface  can  constrain/limit  the  amount  of 
image  processing  done.  What  we  have  found  out,  however,  was  that  the  questions  that  were 
interesting  to  the  natural  language  people,  like  Herskovits,  implied  very  sophisticated  visual 
recognition  that  we,  the  vision  people,  could  not  deliver.  In  particular,  the  natural  language 
researchers  were  interested  in  various  details,  such  as  discriminating  between  different  types  of 
roofs  and  buildings,  whereas  the  vision  researchers  were  facing  problems  of  errors  and  mistakes 
that  were  due  to  poor  illumination,  non-robust  edge  detection,  sparse  data  obtained  by  stereo 
algorithms,  and  in  general,  inconsistent  segmentation  of  the  scene  that  was  to  be  interpreted  by 
the  reasoner  coming  from  the  natural  language  end. 
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This  realisation  redirected  the  focus  of  our  research  in  the  vision  area  for  the  past  year,  (which 
started  in  March  1987  and  continues  until  March  1988)  We  have  spent  this  time  thoroughly 
examining  the  parameters  of  the  segmentation  process. 

What  we  have  learned  is: 

1.  Segmentation  can  be  casted  as  a  global  process  (performed  on  the  whole  picture) 
with  some  given  external  parameters,  i.e.,  topological  parameters  that  are  also  the 
goal  of  this  process,  and  the  noise  of  the  camera  system.  This  global  process 
invokes  some  local  (neighborhood  bound)  processes  such  as  edge  detection  and 
boundary  formation  and  region  growing,  or  grouping,  of  similar  elements.  These 
local  processes  have  also  parameters  which  can  be  built  in  a  priori  or  can  be 
variable  and  computed  by  the  segmentation  processor.  These  parameters  are:  size 
of  the  desired  bandpass  for  filtering  and  the  range  of  the  contrast  so  that  proper 
discontinuities  (edges)  can  be  detected  and  the  local  similarity  in  order  to  detect 
continuities  (regions). 

2.  Secondly,  we  are  learning  how  the  local  parameters  interact  with  the  global 
parameters  for  a  given  goal  of  the  segmentation  process.  Clearly  the  result  of  the 
segmentation,  i.e.,  the  segmented  image  is  not  unique,  and  hence,  must  be 
described  in  addition  to  its  geometry  with  some  degree  of  uncertainty.  This  is  well 
justified  since  the  problem  is  unconstrained,  there  is  just  not  enough  information 
(bottom-up  or  topdown).  In  fact,  this  is  an  example  of  the  famous  psychological 
problem  of  Figure-Ground  disambiguation. 
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